NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly

https://doi.org/10.1145/3629526.3653835

Hanindhito, Bagus; John, Lizy K (May 2024, ACM)

Full Text Available
Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training

https://doi.org/10.1109/ISPASS61541.2024.00031

Hanindhito, Bagus; Patel, Bhavesh; John, Lizy K (May 2024, IEEE)

Full Text Available
SACHI: A Stationarity-Aware, All-Digital, Near-Memory, Ising Architecture

https://doi.org/10.1109/HPCA57654.2024.00061

Sundara_Raman, Siddhartha Raman; John, Lizy K; Kulkarni, Jaydeep P (March 2024, IEEE)

Full Text Available
Dendrite-inspired Computing to Improve Resilience of Neural Networks to Faults in Emerging Memory Technologies

https://doi.org/10.1109/ICRC60800.2023.10386729

John, Lizy K; França, Felipe_M G; Mitra, Subhasish; Susskind, Zachary; Lima, Priscila_M V; Miranda, Igor_D S; John, Eugene B; Dutra, Diego_L C; Breternitz, Mauricio (December 2023, IEEE)

Full Text Available
CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

https://doi.org/10.1145/3603504

Arora, Aman; Bhamburkar, Atharva; Borda, Aatman; Anand, Tanmay; Sehgal, Rishabh; Hanindhito, Bagus; Gaillardon, Pierre-Emmanuel; Kulkarni, Jaydeep; John, Lizy K. (July 2023, ACM Transactions on Reconfigurable Technology and Systems)

Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-in-Memory Blocks forFPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.
more » « less
Full Text Available
ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks

https://doi.org/10.1145/3629522

Susskind, Zachary; Arora, Aman; Miranda, Igor_D S; Bacellar, Alan_T L; Villon, Luis_A Q; Katopodis, Rafael F; de_Araújo, Leandro S; Dutra, Diego_L C; Lima, Priscila_M V; França, Felipe_M G; et al (December 2023, ACM Transactions on Architecture and Code Optimization)

‘‘Extreme edge”¹devices, such as smart sensors, are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0–14.3 million inferences per second, improving area-normalized throughput by an average of 3.6× and steady-state energy efficiency by an average of 7.1× compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.
more » « less
Full Text Available
Tensor Slices: FPGA Building Blocks For The Deep Learning Era

https://doi.org/10.1145/3529650

Arora, Aman; Ghosh, Moinak; Mehta, Samidh; Betz, Vaughn; John, Lizy K. (December 2022, ACM Transactions on Reconfigurable Technology and Systems)

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.
more » « less
Full Text Available
Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research

https://doi.org/10.1109/TCAD.2023.3272582

Arora, Aman; Boutros, Andrew; Damghani, Seyed Alireza; Mathur, Karan; Mohanty, Vedant; Anand, Tanmay; Elgammal, Mohamed A.; Kent, Kenneth B.; Betz, Vaughn; John, Lizy K. (May 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
GAPS: GPU-acceleration of PDE solvers for wave simulation

https://doi.org/10.1145/3524059.3532373

Hanindhito, Bagus; Gourounas, Dimitrios; Fathi, Arash; Trenev, Dimitar; Gerstlauer, Andreas; John, Lizy K. (June 2022, ACM International Conference on Supercomputing (ICS))

Full Text Available
Hardware-aware 3D Model Workload Selection and Characterization for Graphics and ML Applications

https://doi.org/10.1109/ISQED54688.2022.9806296

Li, Ruihao; Arora, Aman; Li, Sikan; Wu, Qinzhe; John, Lizy K. (April 2022, International Symposium on Quality Electronic Design (ISQED))

Full Text Available

« Prev Next »

Search for: All records